title: “What next? Modeling human behavior using smartphone usage data and (deep) recommender systems” subtitle: author: | | Simon Wiegrebe date: “October 01, 2021” output: beamer_presentation: # includes: # in_header: head.tex toc: true slide_level: 2 theme: “Goettingen” colortheme: “dolphin” fonttheme: “structurebold” bibliography: bibliography.bib biblio-style: myabbrvnat header-includes: -
1 Motivation
Introduction
- Smartphone usage has been becoming a valuable source of data in recent years:
- large volume
- ubiquitous
- easily accessible
- clean
- representative of actual human behavior
- Behavioral researchers: investigating human behavioral traits through smartphone usage
- Most behavioral research: association between smartphone usage patterns and pre-established personality traits
- Here: data-centric approach to the modeling of human behavioral sequences
Research Idea
Smartphone usage data from a PhoneStudy project [@phonedata]
There is a natural sequential order in the data:
- An app session starts by switching on the screen and ends by switching it off
- The apps used in between, ordered by their timestamps, as well as the ON and OFF tokens form the events of an app session
Model behavioral sequences by means of next-event prediction
Large number of possible events + sequential data \(\rightarrow\) Use sequence-aware recommender system (RS) algorithms
2 Data
Description
- PhoneStudy dataset from a mobile sensing research project [@phonedata]
- 310 users, study period from October 29, 2017 through January 22, 2018
- Each app usage assigned exact opening date and time
App-level Representation
- In language modeling:
- Tokens \(\widehat{=}\) words
- Sentence \(\widehat{=}\) concatenation of tokens ending with a period
- Here:
- Tokens \(\widehat{=}\) apps
- Sentences \(\widehat{=}\) sessions
- Objective: next-app prediction
- Predicting the next app a user is going to use in a given session
- Mostly very short sessions
Sequence-level Representation
- How to address the issue of short session length?
- Focus on behavior, not individual apps
- App-level sessions \(:=\) concatenations of app-level
- These categories were pre-established by @stachl2020predicting
- E.g.: “WhatsApp” \(\rightarrow\) “Messaging”
- Now:
- Tokens \(\widehat{=}\) app-level sessions
- Sentences \(\widehat{=}\) daily concatenations of a user’s sessions
- For the sake of unambiguousness: use the terms “sequence” and “event”
Summary Statistics
- Drawback of sequence-level analysis: data size
- events in sequence-level \(\approx\) sequences in app-level
3 Modeling
Definitions and Terminology
- Baseline model \(:=\) non-NN-based model
- Session-based model: no incorporation of user-level information (user ID)
- Session-aware model: incorporation of user-level information
- \(s=(s_1, s_2, \dots, s_m)\): sequence of chronologically ordered events
- \(s_s\): last “known” event in the sequence
- \(s_{s+1}\): event we seek to predict
- \(i\): candidate event for \(s_{s+1}\)
Session-based Baseline Models (I)
- and [@ludewig2018evaluation]
- are based on co-occurrence frequencies
- only take into account \(s_s\) when making a prediction
-
- simply counts co-occurrences of \(s_s\) with \(i\)
- normalizes this count by the number of all co-occurrences
-
- accounts for sequential event order
- only counts co-occurrences where \(s_s\) precedes any \(i\)
- decreases the weight if other events occurred in-between
Session-based Baseline Models (II)
- The neighborhood-based [@jannach2017recurrent]
- defines a neighborhood of most similar past sequences
- determines similarity between \(s\) and neighbor sequences
- computes score as sum of similarity scores across all sequences which contain \(i\)
- [@garg2019sequence] and [@ludewig2021empirical] extend , for instance by
- accounting for event recency in \(s\) using decay functions
- accounting for sequence recency of neighbor sequences
Session-based Neural Models
- [@hidasi2015session]
- initially one-hot encodes single input events
- feeds input vectors into a Gated Recurrent Unit (GRU) layer
- uses pairwise ranking losses for training
- outputs, for each event, the likelihood of being next in the sequence
Session-aware Neural Models
- [@quadrana2017personalizing]
- is a user-aware extension of
- contains a short- and long-term memory GRU layer
- generates recommendations for each event in a sequence through a session-level GRU (like )
- updates an additional user-level GRU at the end of each sequence
- employs its hidden state for initialization of the session-level GRU at the beginning of the next sequence
Extensions
- We use (a combination of) 3 different heuristics for some session-based algorithms
- Extensions contribute user-level information from past sequences \(\rightarrow\) session-awareness
- prepends events from the user’s preceding sequence if \(s\) is short
- increases the score of \(i\) if \(i\) has occurred in the user’s past sequences
- adds a reminder score to the original model score
Implementation
- Implementation of models and extensions based on @latifi2021session
- We perform all modeling, evaluation, and analysis tasks in Python
4 Evaluation
Train-Validation-Test Split
- Time-ordered and user-clustered data
- Standard time-agnostic cross-validation not applicable
- Last-event split method, applied twice:
- Clip off each user’s last sequence \(\rightarrow\) test set
- Clip off each user’s last sequence from the remaining data \(\rightarrow\) validation set
- Each user required to have \(\ge 3\) sequences
- Additionally: split study period into 5 equally long sub-periods (windows)
- Apply train-validation-test split to all 5 windows
- Average performance results across all 5 test sets
Evaluation Protocol
- Evaluate predictions for all but the first test sequence events
- Preferable to
- How to define ground truth:
- Mostly interested in predicting the single next action of a user
- Our definition: only the event observed at specific position
Evaluation Metrics (I)
- Target variable follows a multinomial distribution with large number of categories
- We wish to quantify the goodness of our recommendation list of length k
- We wish to perform next-event prediction with our ground truth being a single event
- Let \(n\) be the total number of events to be predicted
Evaluation Metrics (II)
Hit Rate (HR): \(HR@k\) is simply the fraction of events for which the corresponding recommendation list of length \(k\), \(rl(k)_i\), includes the ground truth, \(y_i\): \[\begin{align*}
HR@k &= \frac{\sum_{i=1}^n \mathbbm{1}_{rl(k)_i}(y_i)}{n}
\end{align*}\]
Mean Reciprocal Rank (MRR): \(MRR@k\) additionally accounts for the ranking within the recommendation list. \(MRR@k\) computes the reciprocal rank of the ground truth within the recommendation list, \(rr_i\), then averages this reciprocal rank across all \(n\) events: \[\begin{align*}
MRR@k = \frac{\sum_{i=1}^n rr_i}{n}
\end{align*}\]
- We consider \(HR@k\) and \(MRR@k\) for \(k \in \{1,5,10,20\}\)
Tuning
- Simple random search with budget 100 for each algorithm
- Hyperparameter search spaces as in @latifi2021session
- Tuning on five-window data, then averaging performance to determine optimal hyperparameter configuration
- Tuning metric: \(HR@1\)
5 App-level Results
Minimum Sequence Length (I)
- Background:
- , , and employ RNNs
- These learn from the present sequence whereas non-neural methods mostly “look up” similar sequences or app combinations
- App-level sequences are typically short \(\rightarrow\) RNN-based methods do not have “much to learn from”
- Hypotheses:
- Better performance of NN-based models on longer sequences
- No impact of sequence length on performance of , , and
\(\rightarrow\) Train and evaluate our models on a subset containing only sequences with at least 20 events.
Minimum Sequence Length (II)
- still best performer for \(HR@1\) and \(HR@5\)
- No large changes for , , and
- Performance of NN-based models improves
Minimum Sequence Length (III)
- What if instead we train on all sequences and only evaluate on long sequences?
- still best performer
- All neural models perform considerably worse
- Surprising because the full training dataset is considerably larger
- Conclusion: performance on long sequences benefits from training on long sequences only
Position in Test Sequence (I)
- Initial performance boost for
- No clear trend for all other models
Position in Test Sequence (II)
- Worse performance for NN-based models on later positions
- if training is not tailored towards them: NN-based models struggle with later positions in the prediction sequences and, consequently, with long prediction sequences
Removing ON and OFF Events (I)
- Key issue and potential performance bottleneck: short sequence length
- ON and OFF events are hardly informative
- ON-OFF sequences make up \(38.91\%\) of all app-level sequences
- Effect of dropping all ON and OFF events from the app-level data?
Removing ON and OFF Events (II)
- Improvements i.t.o. \(HR@1\) across the board
- Substantial improvements for neighborhood-based models
- Drawback: limited representativeness of results
Category-level Prediction (I)
- Ultimate goal: predict human behavioral sequences \(\rightarrow\) consider next-category prediction instead of next-app prediction.
- For evaluation, simply consider app category: e.g., “messaging” instead of “WhatsApp”.
- If performance improves considerably: models learn more about behavioral sequences than previously thought
Category-level Prediction (II)
- Performance increases especially for larger \(k\), more pronounced for NN-based methods, and proportional to app-level performance
Embedding Analysis (I)
- Can deep learning models learn smartphone app semantics?
- Do apps from a common app category form clusters in the embedding space? \(\rightarrow\) Add an embedding layer (\(d=128\)) to
- Apply TSNE [@hinton2002stochastic] to obtain two-dimensional app embeddings
Embedding Analysis (II)
- No category-level clustering recognizable
- Only for \(11.67\%\) of apps their most similar app (i.t.o. cosine similarity) is from the same category
Embedding Analysis (III)
- Alternatively: start off with data-driven clustering approach k-means
- Look at potential accumulations of app categories within each cluster
Embedding Analysis (IV)
- Moccasin-colored cluster: 32 out of 52 apps (>60%) are camera or image editing apps
- However: vast majority of clusters dispersed across app space, with little intra-cluster app category clustering.
Embedding Analysis (V)
- Experimentally construct app analogies such as “Messaging 1 + Social Network 1 - Social Networks 2 = ???”.
- We find no meaningful app analogies in our embeddings:
- App analogies conceptually much less intuitive than word embeddings
- Low overall quality of embeddings
6 Sequence-level Results
Removing ON-OFF Tokens (I)
- Suspiciously high \(HR@1\) performance across all algorithms
- High prevalence of ON-OFF tokens (\(51.06\%\))
- All algorithms predict ON-OFF tokens (almost) everywhere
- Predictive performance on other tokens ~\(0\%\)
- Effect of removing ON and OFF events from underlying app-level data?
Removing ON-OFF Tokens (II)
- Performance drops for all algorithms, especially i.t.o. \(HR@1\)
- best, and worst performers
Position in Test Sequence (I)
- ON and OFF events removed from the underlying app-level data
- No clear trend for any of the models
Position in Test Sequence (II)
- All models except perform better on later positions of the test sequences
- The precise positioning of the cutoff not very relevant
Position in Test Sequence (III)
- For NN-based models: performance improvement for later events in line with expectations
- Comparison app- versus sequence-level data:
- App-level setting: predominantly short sequences
- Sequence-level setting: mostly long sequences
- Corroborates our previous conclusion: differences in sequence lengths between training and evaluation data negatively affect the performance of NN-based algorithms.
7 Discussion
Conclusion (I)
- By and large, strong predictive performance of most algorithms
- NN-based models mostly perform well i.t.o. \(HR@1\) and \(HR@5\)
- Amongst them, is often the weakest one
- NN-based model performance is prone to sequence length and data size
- NN-based models are very expensive i.t.o. runtime and computational effort
- Simple, non-NN models are the preferable modeling choice for our data
Conclusion (II)
recommendable i.t.o. \(HR@1\) and \(HR@5\), no tuning
exhibits strong performance i.t.o. \(HR@10\) and \(HR@20\), fast
No overarching user-level effects in our data
- For predicting future behavioral sequences of a particular user, not overly helpful to know this particular person’s past smartphone usage patterns
User-level extensions mostly effective, especially for short sequences and early positions
- not due to some profound user-level learning
- instead, addressing technical weaknesses of the session-based baseline algorithm
- e.g., alleviates poor early-position performance of other neighborhood-based models stemming from low informational content in short sequences
Limitations
- Dataset size: potentially giving a relative advantage to non-neural methods
- Algorithm selection: not including some of the modern sophisticated approaches, e.g., [@sun2019bert4rec]
- Attention-based models require even more training data
- Their main advantage is the better handling of dependencies while we mostly have sequences
Suggestions for Future Research
- Increased dataset size: new PhoneStudy dataset \(\rightarrow\) Investigate impact of data size on (NN-based) model performance
- Information extraction: incorporation of duration, exact daytime, and geolocation of app usage
- Transfer learning: use of pre-trained transformers?
8 References